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ON  THE  USE  OF  RIDGE  AND  STEIN-TYPE  ESTIMATORS  IN  PREDICTION 
•  Alan  E.  Gelfand 


1 .  Introduction 

For  the  usual  regression  model  with  fixed  regressors, 

Y  ■  X8  +  e,  Ynxl,  Xnxp  full  rank,  Bpxl  and  enxl  -  (O.a’I), 
there  is  considerable  literature  devoted  to  alternatives  to 

A 

the  ordinary  least  squares  estimator,  6QLS  of  B.  From  work 
originally  dating  to  Stein  (1956)  and  James  and  Stein  (1961) 
when  e  is  normally  distributed  and  p  ^  3,  BQLS  is  inadmissible 

*  IT.  ~ 

under  loss  (8-B)  Q(6-B),  Q  an  unrestricted  positive  definite 
matrix.  Thus,  much  of  this  extant  discussion  focuses  on  the 
development  of  biased  estimators  with  small  "variances"  which 
achieve  a  smaller  expected  loss  either  uniformly  over  p-dimen- 
slonal  Euclidean  space  or  at  least  in  the  vicinity  of  some  speci¬ 
fied  0*.  Two  "classes"  of .such  reduced  variance  regression 
estimators  are  particularly  well  discussed  -  ridge  estimators 
and  Stein-type  estimators.  Either  directly  or  upon  orthogonal 
transformation  these  estimators  take  the  form 


(1) 


8C  *  °»OLS  +  (I-C)S* 


where  C  is  a  diagonal  matrix,  usually  data  dependent.  They  may 
also  be  seen  to  be  Bayes  or  "Empirical"  Bayes  procedures  as  well. 
The  review  paper  by  Draper  and  Van  Nostrand  (1979)  provides  an 


excellent  summary  of  both  the  theoretical  and  simulated  effort 


uKi 


A 


£ 


in  this  area..  In  the  context  of  cross-validation,  i.e.  of 
examining  the  performance  of  an  estimator  obtained  in  one 
sample  in  prediction  in  a  second  independent  sample,  the  work 
of  Stone  ( 197 ^ )  leads  to  estimators  of  the  form  in  (1)  as  well. 
Herein  we  consider  the  simplest  such  cross-validation 

problem.  At  a  new  vector  of  predictor  values,  XQ,  we  seek  to 

T  T  2 

estimate  XQ8.  We  take  as  loss  function  (6(Y)-XQB)  for  an 

estimator  6(Y)  and  we  assume  henceforth  that  e  is  normally  dis¬ 
tributed  with  02  unknown.  Our  problem  differs  from  that  of 
estimating  the  vector  8  since  the  results  of  Cohen  ( 19 6 5 )  show 

rn'*  m 

that  aXo®OLS  is  an  admlsslble  estimator  of  X*B  for  0  <_  o  '<  1, 
i.e.  the  UMVU  estimator  is  admissible.  (In  fact,  6(Y)  of  the 
form  yTY  is  admissible  for  X^B  I.f.f. 

(2Y-X(XTxr1X0)T(2Y-X(XTxr1X0)  <  Xq(XTX)"1X().  )  Nonetheless, 
if  we  have  some  confidence  in  8*,  i.e.  that  8*  is  near  the  true 
value  8,  then  it  makes  sense  to  attempt  to  Improve  upon  xq®OLS 
in  the  "vicinity  of  B*"  using  estimators  of  the  form  (1).  More 
specifically,  how  well  do  the  "classes"  of  ridge  estimators  and 
of  Stein-type  estimators  perform  In  this  prediction?  Can  we 
make  a  "best"  choice  within  these  classes. for  a  particular 
prediction? 

The  problem  of  prediction  of  an  independent  observation  YQ 

p 

at  XQ  using  the  loss  (6(Y)-YQ)  is  equivalent  to  that  of  pre¬ 
dicting  xJb,  i.e.  Eg(6(Y)-YQ)2  =  a2  +  Eg ( 6 ( Y) -X^B )  . 


comes 


IflA 

For  an  estimator  of  the  form  X^g,  the  expected  loss  be- 

(2)  Eg ( 6“8 )TX„Xq( )  -  EeC«ol(®i'Bin2  • 

A 

In  the  sequel  we  take  the  generalized  ridge  estimator  gR  to  be 

(3)  gR  -  (XTX+A)”1(XTY+Ag«) 

where  A  is  p.d.  symmetric  and  possibly  dependent  on  Y.  We  take 

A 

the  general  Stein-type  estimator  gg  to  be 
(*»)  gg  «  (1-  c/Q)gQLS  +  c/Q  g* 

a  mm  a 

where  Q  -  (B0LS"B*^  X  X^OLS”ft**  and  c  may  dePend  on  Y*  In 
practice  c/Q  is  usually  replaced  by  min(c/Q,  1). 

In  section  2  we  calculate  the  risk;  (2),  of  the  estimators 
(3)  and  (*»)  when  A,  c  are  constant.  We  then  investigate  "best" 
choices  for  A,  c.  Since  these  choices  will  be  functions  of  g 
and  a2  as  well  as  Xq,  A  and  c  must  be  estimated  from  Y.  In  section 
3  we  summarize  a  simulation  study  which  compares  the  performance 
of  versions  of  (3)  and  (4)-  which  are  discussed  for  the  estimation 
of  B  along  with  others  motivated  by  work  in  section  2.  In  section 
t  we  offer  concluding  remarks  in  particular  with  regard  to 
multiple  prediction. 


Note:  Normality  is  not  employed  in  this  calculation. 


In  (3)  A .is  usually  taken  to  be  diagonal  and,  in  fact, 

the  class  of  ridge  (as  opposed  to  generalized  ridge)  estimators 

sets  A  =  al,  a  >_  0.  The  case  where  either  by  design  or  trans- 
T  2  2 

formation  XX  ■  I  reduces  (5)  to  o  £XQ1  and  reduces  (6),  for 
generalized  ridge  estimators  (a^  are  the  diagonal  elements  of 
the  diagonal  matrix  A)  to 


(7) 


2  2 
°  EX0i 


(l+ai )' 


[EX01(Bi-B1*) 


ai 

1+a, 


Investigation  of  this  expression  reveals  that  an  optimal 
choice  for  the  a^^  to  minimize  (7)  needn't  exist  although  local 
minima  can  be  found.  In  the  case  of  ridge  estimation,  i.e. 
all  aA  =  a,  a  unique  minimum  can  be  found.  This  occurs  at 


(8) 


2  -2YTY 
o  y  X  JC 


0  0 


where  y  *  Xq(B-8*).  Note  that  aQ  >  0  and  finite  provided  B-B* 
isn't  orthogonal  to  XQ.  The  associated  minimum  equals 


2  2  T 

0  T  *ix0 

t2+o2x^o 


T 

When  B  is  such  that  S-B*  is  orthogonal  to  XQ,  then  XQB*  predicts 
perfectly.  For  such  B's  we  can  obtain  zero  expected  loss  and 


would  want  no  weight  attached  to  8 


,  l.e.  would  want  a 


OLS 

In  fact,  it  is  clear  that  for  XQ,S  fixed  there  will  be  a  set 

T 

of  8'  's  which  predict  XQ  perfectly  and  that  8'  needn’t  be 

close  to  8  in  Euclidean  distance.  Thus  the  appropriate  pseudo- 

metric  for  the  prediction  problem  Is  *  Thls 

pseudometric  clarifies  the  earlier  notion  of  "vicinity  of  8*" 

and  under  this  distance  the  further  8*  is  from  8  the  closer  a 

* 

is  to  0,  l.e.  the  more  weight  is  placed  on  8qLS,  the  closer  8* 
is  to  6  the  larger  a  becomes,  i.e.  the  more  weight  is  placed 
on- 6*.  As  would  be  expected,  aQ  is  invariant  to  scaling  of  XQ, 
although  the  risk  clearly  isn't. 

T 

Using  (8)  our  estimator  of  XQB  is 

%  -  (ltaO)'lxO«OLS  +  <1+a0>'laOX06* 

and,  in  fact,  for  any  fixed  a  >  0,  T  improves  upon  Te 

SI  U  ULo 

2  2  -l , 

whenever  y  <  a  a  (2+a). 

Prom  (8)  a  convenient  estimator  of  aQ  is: 

(9)  aQ  «  oV^Xq 


when  o2  is  the  usual  UMVU  estimator  of  o2  and  y  = 

^  ^  a 

The  fact  that  Ey”  doesn't  exist  suggests  that  aQ  will  be  very 


unstable  and  that  T *  will  perform  poorly.  We  return  to  this 

a0  *- 
point  in  the  discussion  of  the  simulation  study.  Since  ad  is 


7 


independent  of  8QLS  and  aQ  depends  on  Bnr  <,  only  through  X^Bnr<,, 

U  ULo 


we  may  compute,  the  expected  loss  for  T~  .  If  x2  =  o2X^X  ,  tl 
^  ^  ^  ao  o  0 9 

T  -  N(y,t2)  and,  with  T2  -  o2xjx0,  (2)  for  T*  becomes 


do)  zaJL  ;_y)2  -  t . 


'y' Ap  Ap 

T  Y  +T 


( T  2 ) 2 


The  equality  (10)  is  seen  using  the  identity  E^f (y) (y-y) 

*  T2E^f '  (y)  (Stein  (1973))  valid  provided  E^|f'(y)j  <  »  which, 

as  the  following  calculations  show,  is  the  case).  Now 

~2  2  2  2  2  *2  2  2 
y  /t  -  X.  ,  y  /2t  independent  of  x  /x  -  X  .  Hence 
a  *  n— p 

(Y2+T2)"1T2jL~Be(J1|£-,  where  L  -  Po(y2/2T2).  The  expectation 

of  each  term  in  (10)  can  thus  be  evaluated  and  (10)  becomes 


(11)  x2(L,(n-p)E(n-p.2^3)-1[^><2^1>  -  g^|^3). 


2 

If  we  divide  (11)  by  t  ,  i.e.  consider  the  risk  relative 

IJIA 

to  that  of  XqB0LS,  then  this  relative  risk  is  a  function  of 
2  2  2 

y  /x  .  Hence  we  set  x  ®  1  and  examine  the  simpler  estimator 
*  2  — 1*3 

(y  +1)  y  which  may  be  thought  of  as  an  "empirical"  Bayes  estimator 
against  a  normal  prior  centered  at  0,  adjusted  to  have  no  singularitii 
In  R1.  The  risk  of  this  estimator  is  readily  obtained  to  be 
1  +  EY(Y  +D  (3y  -2)  by  an  argument  similar  to  that  leading  to 
(10).  This  risk  (symmetric  about  0)  is  graphed  in  Figure  1  against 
Y  >  0  to  illustrate  what  may  be  expected,  up  to  scaling,  if  (11) 

Is  evaluated.  Mote  that  the  risk  is  bounded  and  considerably  i 


~2  _  2  ~ 

less  than  1  for  y  small.  Because  (y  +1)  y  has  singularities 

in  the  complex,  plane  it  is  not  admissible. 

T 

If  we  restore  X  X,  not  necessarily  diagonal,  our  estimator 
<T>  ~  T 

in  (3)  has  A  =  aQX  X  or  aQX  X  according  to  (8)  or  (9). 

a 

Theorem  2:  For  Bg  as  in  (M)  with  p  >  2,  (2)  becomes 
(12)  o2Xq(XTX)_1X0  +  XQ(XTX)"1X0[(c2+Mco2)ri/a2-2cr2] 
where 


with 

L  -  Po( A )  ,  A  =  y2/2o2Xq(XTX)“1X0 

(13) 

M  -  Po(6 )  ,  6  =  (AX^(XTX)~1X0-y2)/2o2Xq(XTX)”1X0 

where  L,M  independent  and  A  =  (B-B* )^X^X(B-B* ) . 

Proof:  As  in  Theorem  1,  let  R  be  nonsingular  such  that 
RXTXRT  =  I  and  let  a  =  (R-1  )T( B0LS"B* ) .  Then  a  -  N(o,a2I)  with 
a  =  (R-1 )T (B-B* ) ,  Q  =  aTa,  and  (2)  becomes 


m 


where  w  =  RXQ  (and  w  w  =  Xg(X*X)  xXq). 
If  we  expand  (14)  we  obtain 


(15)  E  (wT(a-a))2+c2E(wTa)2/(aTa)2-2cE  wT(£-a)  . 

a  a  i 

a  a 


The  last  term  may  be  written  as  -2cEw,E  f(a)(a,-a.)  where 

1  Cl  11 

f(a)  =  (a^cO'^w^a) .  Using  the  Stein  identity, 

* 

(g2E  o -  E^f (a) (a^-a^ ) ,  which  is  valid  here),  on  this 


expression,  after  manipulation  (15)  becomes 


(16)  o  w  w  +  (c+4ca)Ea(wa)/(aa)^  -  2co  w  wEa(l/aa). 

Finally  if  we  let  U  -  (vfi2  ,  V  =  aTa  -  -(w^-2  ,  then 

•  w  w  wlv 

^|L  -  X2+2L  with  L  as  in  (13) 


~|M  -  X2  with  M  as  in  (13) 

a  P-1+2M 


and  given  L  and  M,  - 

U+V  v2  •  „ 

o2  '  Xp+2 ( L+M ) *  Hence 


1  +  2L  P-1  +  2M  x  .  ,  ,  _  „ 

,~2 — *  — 2 — '  independent  of 


=  *  ^aE(~^2  iL’N)  =  W  wEa[E(uWlL*H)E(OW  iL>N)] 


w  w  _  2L+1 

o2  L  ( P+2 (L 


T 

w  w  r 

2  1 
a 


and  similarly 


Ea(~T~)  2  P2 
a  a  a 


Making  these  substitutions  into  (16)  and  restoring  XQ  we  obtain 

(12).  0 

Note:  The  proof  reveals  that  the  expected  loss,  (2),  for  more  general 

A 

estimators  of  the  form  (l-h(Q) )BQLS+h(Q)8*  can  be  developed.  In 


fact,  if 


ahio)^  , 

56ols,i 


the  loss  is 


(17)  o2Xq(XTX)  1X0+E6h2(Q)Y2-2o2XQ(XTX)~1X0Eeh(Q)-^o2EBh’ (Q)y2 


Inspection  of  (12)  reveals  that  the  unique  best  c  is 


(18) 


2/2 

C0  =  0  (f7  ~  2) 


Since  \,6  are  invariant  to  scaling  of  Xn  so  is  cn.  It  Is 


11 


apparent  that  for  XQ  fixed  as  a  -  0,  T^/T^  ■*  p,  i.e. 

2 

c0  0  (P-2)**  Hence  if  we  believe  6*  is  close  to  6  the 
t  2 

usual  constant,  o  (p-2),  may  be  employed.  Using  this  constant, 
if  A  is  near  0,  the  relative  risk  of  x’JSs  to  x£bols  will,  from 
(12),  be  near  2/p,  as  it  is  in  the  case  of  estimating  $.  As  in 
remarks  after  (10),  if  a2  is  the  usual  estimator  of  o2  independent 

of  6OLS  we  may  comPute  (2)  for  8S  as  in  (4)  with  c  =  S2(p-2). 

We  obtain 

(19)  o2Xp(XTX)"1X0(l+r1(p2-'t+2(n-p)‘1(p-2)2)-2r2(p-2)) . 

Expressions  similar  to  (19)  can  be  obtained  for  instances 
of  the  more  general  estimators  mentioned  above  (17).  This  suggests 
that  the  risk  (2)  may  be  calculated  for  commonly  used  (adaptive) 


V 


ridge  estimators,  e.g.  those  considered  in  section  3.  However, 
without  restrictions  on  the  design  matrix  X,  these  estimators 

a 

often  fail  to  either  provide  a  in  closed  form  or  to  define 

a 

a  as  a  function  of  Q. 


Since  to  a  first  order  approximation  cn  t  o2(2A+l  )“1(p-2+2(6-A) ) 


we  may  estimate  cn  by 


(20) 


-1 


cn  =  a  (2A+1 )  (p-2+2  (<5-A  ) ) 


where  A, 6  are  the  expressions  in  (13)  with  B  replaced  by  B 


OLS 


We  would  truncate  cn  to  the  interval  [0,Q].  Since  EA_1  doesn't 


A-*  A-*.  -V  *  mJl  a.'.aA  a  I  I  m U 


12 


a 

exist  Cg  will  be  very  unstable  (as  with  aQ  in  (9))  suggesting 
that  the  resulting  predictor  will  perform  poorly.  Again  we 
return  to  this  point  in  the  next  section. 

3.  A  Simulation  Study 

A  simulation  was  conducted  to  compare  the  use  of  the  OLS 
predictor  with  the  predictors  discussed  in  the  previous  section 
and  with  predictors  arising  from  other  estimators  of  8  which 
have  been  discussed  in  the  literature.  For  convenience  we  set 
o2  =  1  and  take  XTX  =  I,  i.e.  B0LS  -  N(e,I).2  Without  loss  of 
generality  we  set  8*  *  0  and  XqXq  =  1.  Under  this  setup  ridge 

A  1  A 

estimators  become  (1+a)  B0LS  and  Stein  estimators  become 
(1  “  c/ $q£jS ^OLS  ^  BoLS*  ^  addition  to  we  consider  the 

following  six  estimators  of  8  (*  ridge  type,  2  Stein  type). 

A  ^  a 

U)  Bhk ■-  arising  from  a  =  P/BOLSBOLS*  The  rldKe  estimators 
discussed  by  Hoerl  and  Kennard  (1970),  Hoerl,  Kennard 
and  Baldwin  (1975)  and,  in  fact.  Lawless  and  Wang  (1976) 

a 

reduce  to  8R^  in  our  setup. 

(ii)  Brm  -  arising  from  a  s  p/(BqLSB0LS-p)  with  (1+a)”1  *  0 
* 

if  ^OLS^OLS  -  P*  The  and  STEINM  estimators  discussed 

by  Dempster,  Shatzoff,  and  Wermuth  (1977)  reduce  to  fLM. 

RM 

(lli)  8hg  -  arising  from  (1  -  p/8OLSBOLS>  (1-(1-  P/B^B^ 
with  (1+a)  ■  0  if  ^oLS^OLS  —  P*  The  r^dge  estimator 

A 

of  McDonald  and  Galarneau  (1975)  reduces  to  • 


(lv)  8~  -  arising  from  aQ  given  in  (9). 

a0 

* 

a 

(v)  Bp_2  -  arising  from  c  =  p-2,  i.e.  the  "usual"  Stein 
estimator . 

A  A 

(vl)  6,  -  arising  from  cfl  given  in  (20),  truncated  to  [0,®). 

c0 

"Positive  part"  restrictions  were  applied  to  all  "shrinkages" 
in  (v)  and  (vi). 

We  note  that  under  the  above  assumptions  the  risks  in  (3) 

and  (4)  and,  in  fact,  of  the  predictors  arising  from  (i)  -  (vi) 

depend  on  XQ  and  0  only  through  (X^B)2  and  0T0.  Since  (X^b)2 
T 

B  rg  B,  0  <  r  <  1,  we  may  summarize  the  results  in  terms  of 
T 

B  6  and  r.  We  consider  p  -  3,6,10.  For  a  given  p  we  generated 
sets  of  2p  independent  uniform  random  variables  on  the  interval 
[-1,1].  In  each  case  we  considered  the  first  p  observations  as 
a  p  vector,  standardized  to  length  1  and  designated  It  as  an  XQ. 
Similarly  the  second  p  observations  are  considered  as  a  0  vector 
with  scaling  by  .1,1,10.  Hence  we  have  large  0TB,  I.e.  BTB  *  100, 
moderate  BTB,  i.e.  BTB  *  1  and  small  BTB,  I.e.  BTB  »  .01.  For 

a 

each  XQ,0  pair  1000  BqLS's  were  generated  from  N(0,I)  and  using 
XQ  each  of  the  seven  predictors  were  calculated  for  each  of  the 
1000  replications.  Bias,  variance  and  mean  square  error  (MSE) 
were  estimated.  A  large  number  of  XQ,B  pairs  (approximately  400) 
were  investigated  enabling  a  wide  range  of  r's.  Table  1  provides 

a 

a  brief  summary  indicating  the  best  6  for  prediction  over  ranges 


for  r  along  with  the  typical  percentage  reduction  in  rl3k 
(using  the  best  predictor  over  that  range),  100(MSE  XqBqLS 

-  MSE  X0®)/MSE  X^Bols. 

Several  comments  are  appropriate: 

(1)  The  cross-over  points  in  Table  1  are  approximate,  but 

in  the  vicinity  of  the  cross-over  competing  predictors  are 


indistinguishable  with  respect  to  MSE. 

(ii)  It  is  not  surprising  that  regardless  of  B,  if  BTB  large 

and  r  large,  the  OLS  predictor  is  best.  In  fact,  if  bTB  large 
and  .01  <  r  <  .5,  the  percent  improvement  of  the  best 
predictor  over  OLS  is  never  greater  than  5%. 

A  A 

(iii)  As  expected,  6*  ,  B~  performed  very  badly,  always  sixth 

C0  a0  T 

or  seventh,  doing  well  only  when  B1B  large  and  r  very 
small  (regardless  of  p).  However,  in  such  cases,  improve¬ 
ments  will  be  substantial,  increasing  as  r  decreases,  while 
the  other  five  predictors  are  indistinguishable.  Near  r  *  .o: 

^  A 

B~  is  best;  much  below  .01,  is  best. 


a0  co 

A 

(iv)  ep-2  is  llkely  the  best  overall  choice  always  amongst 

the  two  or  three  best  apart  from  cases  in  (ill)  above. 

(v)  When  &TB  is  small  or  moderate,  6RM,  BRK,  Bp_2  and  BHG 

T 

were  always  the  best  four.  When  B  B  is  small  and  p  *  3, 
Bmg  is  close  in  performance  to  Bp_2.  When  B  B  Is  small 

A  A 

and  p  =  6,  BRM  and  BMG  split  for  second  best.  When  Be 

A  A 

is  small  and  p  «  10,  6„  5  is  second  with  0„_  third. 

p— ^  Mil 


(vi)  When  r  is  large  cQ  is  almost  always  <0  whence  B~  *  B 

co 

Table  1  reveals  that  in  many  cases  substantial  reduction 

in  squared  error  loss  over  the  OLS  predictor  can  be  achieved. 

It  further  suggests  the  possibility  of  selecting  the  predictor 
T 

according  to  B  B  and  r.  However,  finely  detailed  selection, 

e.g.  according  to  Xq,  will  be  unsuccessful  as  the  performance 
*  *  - 
of  8~  and  reveals.  In  practice  we  will  have  aAj  instead  of  0 

c0  a0 

and  r  =  (AXq(XTX)  1Xq)  1y2,  and  we  might  define  estimators 

A  A  A  p  A  p 

A,r  with  b  replaced  by  80LS ,  0  replaced  by  o  .  We  may  calculate 
E(a)  =  <P+A)  whence  A '  *  A-p  ls  UMVU  for  A.  By  an  argu¬ 

ment  similar  to  that  contained  in  Theorem  2,  we  may  show 
E(r)  =  E(p+2(L+M) )-1(2L+l)  =  r+(l-rp){ (p+A)_1+(p+A)“^2A}  where 
L,M  are  distributed  as  in  (13).  For  individual  predictions, 
preliminary  calculation  of  A’  and  r  should  enable  a  judicious 
choice  of  predictor. 

As  Thisted  and  Morris  (1980,  p.  19)  observe,  the  poorest 
estimation  case  for  ridge  procedures  occurs  when  (with  XTX  *  I, 

B*  *  0)  BT  *  (32*0,0, ...  ,0)  with  62  large.  This  is  also  the 
poorest  estimation  case  for  Stein  type  procedures  in  the  sense 
that  the  first  coordinate  will  account  for  about  half  of  the 

A 

total  risk  and  all  coordinate  risks  would  decrease  if  &2  was 

excluded  (see  Baranchik  (1964)).  For  prediction  this  implies 

2  T 

A  large  and  r  =  Xq^/XqXq.  Hence  this  is  the  poorest  case  for 
prediction  as  well,  i.e.  depending  upon  XQ1 ,  improvement  will 
be  small  or.  In  fact,  the  OLS  predictor  will  be  better. 


V  VA, 
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How  will  multlcollinearity  in  XTX  affect  prediction? 

Let  XTX  *  D.  a  diagonal  matrix  with  diagonal  elements  d.  and 
assume  d^  <  d2  <  . . .  <_  dp.  Then  the  extent  of  multlcollinearity 
is  usually  measured  in  terms  of  how  close  d^  is  to  0.  Although 
ridge  methods  have  been  advocated  for  improved  estimation  when 
d1  is  quite  small,  particularly  relative  to  the  other  d^  Thisted 
and  Morris  (p.  21)  and  others  have  established  that  in  this  case 
optimal  ridge  as  well  as  Stein-type  estimators  will  produce 

a 

inconsequential  improvement  over  BQLS.  For  prediction  (with 

8*  *  0) ,  A  =  Ed18^  and  r  -  (X^B^/UX^/d^Ed  8*).  Bingham  and 

Larntz  (1977,  p.  102)  observe  that  (in  this  notation)  the  worst 

case  for  ridge  estimation  occurs  when  large  8^  are  associated 

with  small  d^  This  tells  us  little  about  the  magnitude  of  A.  How 

T 

ever  for  fixed  XQ  and  8  as  X  X  becomes  more  severely  multicollinear 

2 

var(XQ60LS)  =  EX^/d^  will  grow  larger  and  r  will  become  smaller. 

As  the  simulation  suggests  when  r  is  small,  using  an  appropriate 
predictor,  we  can  expect  significant  improvement  over  the  OLS 
predictor. 

4 .  Multiple  Prediction 

In  concluding  we  offer  several  comments  regarding  multiple 
or  simultaneous  prediction.  Suppose  we  wish  to  make  r  predictions 
defined  by  X1,X2,...,Xr  and  we  set  X*  =  (X1,...,Xr).  For 
convenience  we  assume  the  X ^  are  a  linearly  independent  set  whence 
rank  (X*)  *  r  <  p.  What  is  an  appropriate  loss  to  employ?  One 


v  - 

V  V-A  V*  ■ 
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choice  is  unweighted  sum  of  squared  error  loss,  i.e. 
f(X^(B~B))  =4(B“S)  G^(B“B)  with  =  X*X**^.  A  second  choice 

arises  from  the  Joint  distribution  of  the  x1batq,  i.e. 

jL  Ui/u 

x#T&OLS  ~  N(X*TB,c2(X»T(XTX)  1x*)"1)  suggests  (B-8)TG2(B-B) 
where  G2  *  X*(X*  (XX)  X*)  X*^.  Others  may  be  envisioned  as 

well.  If  r  <  p,  G^,G2  are  positive  semi-definite  whence,  as 
noted  earlier  for  individual  prediction,  we  may  have  loss  equal 

A 

to  0  but  B  not  close  to  B.  Nonetheless,  it  is  well-known  that 
if  P  is  any  positive  definite  matrix  X*TB0LS  is  admissible  for 
**^8  under  loss  (8-8)TX*PX*T(0-B)  if  p  <  2,  inadmissible  if 
p  >  3.  In  fact,  work  done  by  Berger  (1976),  Bhattacharya  (1966), 
Bock  (1975),  Casella  (1977),  Efron  and  Morris  (1976)  and 
Strawderman  (1978)  leads  to  explicit  minimax  predictors  which 
improve  upon  X*  8QLS.  These  predictors  will  be  generalized 
adaptive  ridge  of  the  form 

(I  +  A(X*  &oLS*°  ***  ^  X*,P))  X*  &0LS*  (see  e.g.  Strawderman 

Theorem  6,  p.  626,  for  a  family  of  such  predictors),  parelleling, 

in  a  sense,  the  individual  predictors  X^B^  and  X?JfL  . 

a0  co 

As  Strawderman  notes  (p.  626)  there  is  no  one  predictor 
which  will  dominate  the  OLS  predictor  for  all  P.  For  P  *  I  (I.e. 
G^)  a  very  simple  procedure  is  to  use  estimates  of  A  and  r  to 
select  a  good  predictor  for  each  Xj.  If  we  are  not  prepared  to 
specify  P  the  simulation  study  suggests  that  using  0  2  (i.e. 

C  »  o  (p-2)  in  (4))  regardless  of  XA  is  a  simple  but  perhaps 
adequate  choice. 
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ootnctes 


A  possible  refinement  to  using  the  predictor  defined  by 
Bg  with  c  =  a  (p-2)  would  employ  the  "limited  translation" 
approach  as  discussed  in  Efron  and  Morris  (1972,  p.  136). 
Limiting  the  amount  of  shift  for  each  coordinate  of  BqL<j 
toward  the  corresponding  coordinate  of  Bs  using  a  relevance 
function,  p,  leads  to  an  estimator  3p  and  resulting  predictor 
XQBp.  We  also  note  that  the  estimator,  Bs>  resulting  from 
James  and  Stein  (1961,  p.  366)  sets  c  =  a2(p-2 ) (n-p+2)_1 (n-p) . 
With  this  c  (19)  becomes 

o2Xj(XTX)"1X0{l+(n-p+2)_1(n-p)(p2-^)ri-2(n-p+2)"1(n-2)(p-2)r2). 


1 .  We  recognize  that  these  simplifying  assumptions  diminish  the 
utility  of  the  simulation  study.  In  particular,  certain 
estimators  which  differ  in  a  more  general  model  become 
equivalent.  The  study  is  only  intended  to  be  illustrative 
and  suggestive.  Certainly  a  more  elaborate  one  might  be 
undertaken.  We  also  recognize  that  with  these  simplification 
expressions  for  the  exact  risks  of  some  of  the  predictors 


below  (ignoring  restrictions)  may  be  obtained  using  (17). 


*> 


Figure  1:  Risk  of  (y  +1) 


•Figures  in  parentheses  Indicate  range  of  expected  percent  Improvement  ov 
using  "best"  predictor. 

••Are  indistinguishable. 
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20.  ABSTRACT 

For  the  usual  regression  model  with  fixed  regressors,  there  is  a  con¬ 
siderable  literature  devoted  to  alternatives  to  ordinary  least  squares  esti¬ 
mators  of  the  regression  parameters.  These  alternatives  are  biased  with 
"small"  variances  resulting  in  reduced  mean  square  error  over  some  (perhaps 
all)  of  the  parameter  space.  Two  prominent  classes  of  such  estimators  are 
ridge-type  and  Stein-type  estimators. 

Consider  the  simplest  prediction  problem  in  this  context,  i.e.  prediction 
at  a  single  new  vector  of  prediction  values.  We-ealculat^,  the  risk  (squared 
error)  for  predictors  based  on  estimators  in  the  above  families.  While  the 
ordinary  least  squares  predictor  is  admissible,  a  simulation  study  reveals 
that  over  regions  of  the  parameter  space  substantial  reduction  in  risk  is 
possible  using  estimators  in  these  families.  A  simple  preliminary  procedure 
based  upon  the  vector  of  prediction  values  is  given  to  select  a  "good"  esti¬ 
mator  from  these  families.  It  is  apparent  that  in  multiple  prediction  a 
single  choice  of  estimator  need  not  be  best. 


